Introduction

In a superficial way, this assignment is meant to make sure you’re familiar with plotting spatial data. However, the bulk of your time will most likely be devoted to wrangling and reshaping the data so that it’s ready to be graphed. As we move into the final stretch of the class, the hints will now become more sparse. As with all the previous homeworks, there’s no need to look up fancy packages or techniques. Everything can be done with the tools we already have unless stated otherwise.

The Data

The data are in the form that they were originally collected (except someone was nice enough to gather all the lat/long coordinates of the zip codes for you).

The data come from a Dialect Survey conducted by Bert Vaux. Some limited information can be found at the original depracated website http://www4.uwm.edu/FLL/linguistics/dialect/index.html. Although 122 questions were asked in the survey, the subset of the data provided to you only contains answers to the 67 questions that focused on lexical rather than phonetic differences.

There are three files included in this assignment:

Note that while the rows represent the same data in lingData.txt and lingLocation.txt, they are different observational units. For example, say John and Paul take this questionnaire for two questions. The first question has three answer choices and the second question has four answer choices. If John answered A and D and Paul answered B and D, then lingData would encode two vectors: (1, 4) and (2, 4). If they lived in the same longitude and latitude box, then it would be encoded in lingLocation as one vector: (1, 1, 0, 0, 0, 0, 2).

You’ll need read_delim from the readr package to read in the last two files. Remember to specify the delim argument, which demarcates how fields are separated in the text file.

Task 0

Explore and clean the data. Document what was added/removed, explaining your actions.

lingData = read_delim("lingData.txt", delim = " ")
lingLocation = read_delim("lingLocation.txt", delim = " ")
load("~/Documents/Stat_133/stat133/hw7/question_data.RData")
ans_data = do.call(rbind, all.ans)

fix_tie = function(x){
  str_c(x, collapse = "/", sep = "")
}


names(letters) = 1:26
names(state.name) = state.abb


states = map_data("state") %>% 
  rename(STATE = region) %>% 
  select(STATE, long, lat, group) %>% 
  mutate(STATE = str_to_title(STATE))


new_lingData = lingData %>% 
  gather(question_no, observation, -ID, -CITY, -STATE, -ZIP, -lat, -long) %>% 
  mutate(qnum = parse_number(question_no), 
         ans.let = str_replace_all(observation, letters),
         STATE = str_replace_all(STATE, state.name)) %>% 
  left_join(ans_data) %>% 
  select(-lat, -long) %>% 
  left_join(quest.use)


modified_new_lingData = new_lingData %>% 
  filter(STATE %in% state.name) %>% 
  filter(STATE != "Hawaii") %>% 
  filter(STATE != "Alaska") %>% 
  group_by(qnum, STATE, ans) %>% 
  tally %>% 
  filter(n == max(n)) %>% 
  group_by(quest, STATE) %>% 
  mutate(ans = fix_tie(ans)) %>% 
  unique()

final_lingData = inner_join(states, modified_new_lingData)

write_csv(final_lingData, path = "usa_survey.csv")

We gather all the question number columns in the lingData dataframe into one column called question_no. Then we add a column called qnum in the lingData dataframe that parses the question numbers from strings to numbers. Then we create a column called letter_ans that converts the observation column to letters. We convert the STATE column to non-abbreviated names. We left join the lingData dataframe with the ans_data dataframe that contains the answers and question numbers. After removing the latitude and longitude columns, we left join again with the quest.use data frame. We further clean the data by taking out Hawaii and Alaska. Then we create a new column n that contains the tally of the number of answers given for a question. By creating a fix_tie function, we fix the special cases where the number of answers are the same for some of the answers. We join this cleaned data set with the states data set to include the latitude and longitude of the states.

Task 1

Implement a Shiny App that colors a map of the continental US based off the most common answer for each state. The user should be allowed to pick one of the 67 questions from a dropdown menu. If a state has two or more answers that tied, the map should show the tie as a different color. A static example with a tie in West Virginia is shown below:

As with homework 6, include your server and ui code below along with a link to your app on shinyapps.io.

library(shiny)

usa_survey = read_csv("usa_survey.csv")

shinyUI(fluidPage(
  titlePanel("Most Common Answer to Survey Question for Each State"),
  
  sidebarLayout(
    sidebarPanel(
      
      selectInput("question",
                  label = "Choose a question:",
                  choices = na.omit(unique(usa_survey$quest)))
    ),
    mainPanel(plotOutput("usa"))
  )
))
library(shiny)
library(dplyr)
library(readr)
library(tidyr)
library(ggplot2)
library(ggmap)
library(mapdata)

usa_survey = read_csv("usa_survey.csv")

shinyServer(function(input, output) {
  output$usa = renderPlot({
    
    usa_survey = usa_survey %>% 
      filter(quest %in% input$question)
    
    
    ggplot(usa_survey) +
      geom_polygon(aes(x = long, y = lat, fill = ans, group = group), 
                   color = "black") +
      coord_fixed(1.3) +
      labs(title = str_wrap(input$question),
           x = "",
           y = "") +
      scale_x_discrete(labels = "") +
      scale_y_discrete(labels = "") +
      theme_void() + 
      scale_fill_discrete(name = "Answer",
                          labels = str_wrap(levels(factor(usa_survey$ans)), width = 20)) 
      
      
  })
})

Change the url to the link to your app

Task 2

Make visualization(s) of the lingLocation data for two questions that you found interesting. Remember that each row represents a 1x1 square centered at the given lat/long coordinate.

lingData = read_delim("lingData.txt", delim = " ")
lingLocation = read_delim("lingLocation.txt", delim = " ")
load("~/Documents/Stat_133/stat133/hw7/question_data.RData")
ans_data = do.call(rbind, all.ans)

ans_data50 = ans_data %>% 
  filter(qnum == 50) %>% 
  unite(num.let, qnum, ans.let, sep = "")

ans_data51 = ans_data %>% 
  filter(qnum == 51) %>% 
  unite(num.let, qnum, ans.let, sep = "")

modified_lingLocation = lingLocation %>% 
  select(-(V16:V471)) %>% 
  select(-`Number of people in cell`)
 
data50 = lingLocation %>% 
  rename("50a" = V4) %>% 
  rename("50b" = V5) %>% 
  rename("50c" = V6) %>% 
  rename("50d" = V7) %>% 
  rename("50e" = V8) %>% 
  rename("50f" = V9) %>% 
  rename("50g" = V10) %>% 
  rename("50h" = V11) %>% 
  rename("50i" = V12) %>% 
  select(-c(V13:V471)) %>% 
  gather(num.let, 
         obs, 
         -Cell, 
         -`Number of people in cell`, 
         -Latitude, 
         -Longitude) %>% 
  left_join(ans_data50) %>% 
  group_by(Cell) %>%
  filter(obs == max(obs)) 
## Joining by: "num.let"
data51 = lingLocation %>% 
  rename("51a" = V13) %>% 
  rename("51b" = V14) %>% 
  rename("51c" = V15)%>%
  select(-c(V4:V12)) %>% 
  select(-c(V16:V471)) %>% 
  gather(num.let, 
         obs, 
         -Cell, 
         -`Number of people in cell`, 
         -Latitude, 
         -Longitude) %>% 
  left_join(ans_data51) %>% 
  group_by(Cell) %>% 
  filter(obs == max(obs))
## Joining by: "num.let"
usa_data = map_data("usa")
usa_bbox = make_bbox(lat = lat, lon = long, data = usa_data)
usa_map = get_map(location = usa_bbox, source = "google", maptype = "hybrid")
## Warning: bounding box given to google - spatial extent only approximate.
## converting bounding box to center/zoom specification. (experimental)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=37.25658,-95.844379&zoom=4&size=640x640&scale=2&maptype=hybrid&language=en-EN&sensor=false
ggmap(usa_map) +
  geom_point(data = data50,
             mapping = aes(x = Longitude, y = Latitude, color = ans),
             size = 0.5) +
  labs(title = "What word(s) do you use to address a group of two or more people?")
## Warning: Removed 40 rows containing missing values (geom_point).

ggmap(usa_map) +
  geom_point(data = data51,
             mapping = aes(x = Longitude, y = Latitude, color = ans),
             size = 0.5) +
  labs(title = "Would you say 'Are you coming with?' as a full sentence to mean 'Are you coming with us?'")
## Warning: Removed 38 rows containing missing values (geom_point).